2016 Presidential Campaign Finance for CA

  1. Introduction
  2. Data Preparation
  3. Univariate Plots Section
  4. Univariate Analysis
  5. Bivariate Plots Section
  6. Bivariate Analysis
  7. Multivariate Plots Section
  8. Multivariate Analysis
  9. Final Plots
  10. Reflection

1. Introduction

This report will analyse 2016 Presidencial campaign finance contributions for California. The dataset is available publically at Federal Election Commission. The dataset for CA contains total of 1304347 contributions data records.

The report contains a through analysis of the candidates & contributors and will try to understand what are factors influence people to make contribution towards particular candidates.

2. Data Preparation

Load input file(part of main file) into R data frame. Added one comma at the end of the header as each line of data contains comma at the end and R is expecting one more column.

## 'data.frame':    1304346 obs. of  18 variables:
##  $ cmte_id          : Factor w/ 25 levels "C00458844","C00500587",..: 6 6 6 15 7 15 7 7 7 6 ...
##  $ cand_id          : Factor w/ 25 levels "P00003392","P20002671",..: 1 1 1 23 12 23 12 12 12 1 ...
##  $ cand_nm          : Factor w/ 25 levels "Bush, Jeb","Carson, Benjamin S.",..: 4 4 4 23 20 23 20 20 20 4 ...
##  $ contbr_nm        : Factor w/ 231294 levels " ALERIS, ANNAKIM",..: 7753 31629 70312 175051 117924 176417 119377 119377 119411 92354 ...
##  $ contbr_city      : Factor w/ 2534 levels "","-4086",".",..: 1136 324 729 1183 323 1913 1810 1810 2402 1094 ...
##  $ contbr_st        : Factor w/ 1 level "CA": 1 1 1 1 1 1 1 1 1 1 ...
##  $ contbr_zip       : Factor w/ 143656 levels "","00000","000090272",..: 115155 71795 53864 126927 66455 139488 17327 17327 45924 58804 ...
##  $ contbr_employer  : Factor w/ 65600 levels ""," APPLE INC.",..: 38795 38795 38795 27608 4492 27608 62116 62116 41406 38795 ...
##  $ contbr_occupation: Factor w/ 28622 levels ""," ATTORNEY",..: 21476 21476 21476 12737 23942 12737 17980 17980 19679 21476 ...
##  $ contb_receipt_amt: num  50 200 5 48.3 40 ...
##  $ contb_receipt_dt : Factor w/ 732 levels "01-APR-15","01-APR-16",..: 596 446 27 486 85 564 108 132 85 446 ...
##  $ receipt_desc     : Factor w/ 74 levels "","* EARMARKED CONTRIBUTION: SEE BELOW REATTRIBUTION/REFUND PENDING",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ memo_cd          : Factor w/ 2 levels "","X": 2 2 2 2 1 2 1 1 1 2 ...
##  $ memo_text        : Factor w/ 428 levels "","*","* $550 REFUNDED 6/16/16",..: 40 40 40 1 4 1 4 4 4 40 ...
##  $ form_tp          : Factor w/ 3 levels "SA17A","SA18",..: 2 2 2 2 1 2 1 1 1 2 ...
##  $ file_num         : int  1091718 1091718 1091718 1146165 1077404 1146165 1077404 1077404 1077404 1091718 ...
##  $ tran_id          : Factor w/ 1300659 levels "A000771210424405B8CF",..: 466737 466019 463401 852428 1037145 890506 1038589 1040890 1036607 466057 ...
##  $ election_tp      : Factor w/ 5 levels "","G2016","O2016",..: 4 4 4 2 4 2 4 4 4 4 ...

Data Overview - first 1000 records

3. Data Preparation

In this section I will add new features or modify data definition so that it can enrich the overall analysis. Again I will categorize some features so that they can become more descriptive.

Add Party Name

The below function will get the party name based on the candidates ID. Democrat candidates are identified by name and put into a list. I will focus only on democrats & republican.

I will add a column to the data set to represent the party each member belongs to. Example, Trump is from Republican & Hillary is from Democrats.

##     cmte_id   cand_id                 cand_nm         contbr_nm
## 1 C00575795 P00003392 Clinton, Hillary Rodham        AULL, ANNE
## 2 C00575795 P00003392 Clinton, Hillary Rodham CARROLL, MARYJEAN
##   contbr_city contbr_st contbr_zip contbr_employer contbr_occupation
## 1    LARKSPUR        CA  949391913             N/A           RETIRED
## 2     CAMBRIA        CA  934284638             N/A           RETIRED
##   contb_receipt_amt contb_receipt_dt receipt_desc memo_cd
## 1                50        26-APR-16                    X
## 2               200        20-APR-16                    X
##                memo_text form_tp file_num  tran_id election_tp    cnt
## 1 * HILLARY VICTORY FUND    SA18  1091718 C4768722       P2016 688524
## 2 * HILLARY VICTORY FUND    SA18  1091718 C4747242       P2016 688524
##       party pct_counts
## 1 Democrats      52.99
## 2 Democrats      52.99

Audit Zip Codes

There are a lot of zip codes that are not correctly represented in the dataset. The US zipcode is a five-digit numeric value. There are some zipcodes having 9-digit values. After examining manually I can see that these zipcodes are correct with first 5 digits. I will convert them from 9-digit to 5-digit correct zipcode.

As we are working with California dataset only, we need to check if zipcodes are within CA zipcode boundaries or not. The zip code boundary for CA is 900XX - 961XX.

https://en.wikipedia.org/wiki/ZIP_Code#/media/File:ZIP_Code_zones.svg

##       cmte_id   cand_id          cand_nm     contbr_nm        contbr_city
## 85  C00580100 P80001571 Trump, Donald J. SELLERS, RUTH MARIPOSA, CA 95338
## 766 C00580100 P80001571 Trump, Donald J.  ORTIZ, ERIKA                DPO
##     contbr_st contbr_zip       contbr_employer     contbr_occupation
## 85         CA      99999 INFORMATION REQUESTED INFORMATION REQUESTED
## 766        CA      99999                   DOJ                   DOJ
##     contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text
## 85              48.33        19-NOV-16                    X          
## 766             90.42        15-OCT-16                    X          
##     form_tp file_num     tran_id election_tp   cnt       party pct_counts
## 85     SA18  1146165 SA18.177985       G2016 86258 Republicans       6.64
## 766    SA18  1146165 SA18.106598       G2016 86258 Republicans       6.64

Get Zip codes from Google

I will correct the zip codes using the address/city. I will use google maps API to fetch zip code data for the address. If not found from Google will keep it as NA.

Below function will get zip code using Google API from the address.

##       cmte_id   cand_id          cand_nm     contbr_nm        contbr_city
## 85  C00580100 P80001571 Trump, Donald J. SELLERS, RUTH MARIPOSA, CA 95338
## 766 C00580100 P80001571 Trump, Donald J.  ORTIZ, ERIKA                DPO
##     contbr_st contbr_zip       contbr_employer     contbr_occupation
## 85         CA      95338 INFORMATION REQUESTED INFORMATION REQUESTED
## 766        CA      90040                   DOJ                   DOJ
##     contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text
## 85              48.33        19-NOV-16                    X          
## 766             90.42        15-OCT-16                    X          
##     form_tp file_num     tran_id election_tp   cnt       party pct_counts
## 85     SA18  1146165 SA18.177985       G2016 86258 Republicans       6.64
## 766    SA18  1146165 SA18.106598       G2016 86258 Republicans       6.64

Add contributor gender to the dataset

Guess the gender of the contributor from the name using gender package.

Add employment status as summary

Summerise the employment status as “Employed” or “Not-Employed” as derived from occupation information.

Working data – First 1000 records

Plot Missing data Summary

3. Univariate Plots Section

In this section each feature will be analysed independently.

Plot donation counts for the parties

Plot donation counts for the candidates

Most of the donations are towards Democrats. There are some candidates who received very less donations and these counts are not even visible in the bar plot.

Log Transformation of the previous bar plot to display candidates with less number of donations received.

Plot donation counts by genders

Plot donation counts by months

The below chart will show the total donation counts by months of 2016. I am not including other dates earlier than 2016.

Histogram of contributed amounts

The histogram shows there are a lot of contribution amounts lesser than 0, and most of the contributions is lesser than 300. Again the barplot below shows outliers started at 1500 but there are a lot of donations made above 1500 and I will not remove them as they might represent donations from people with strong socio-economic status.

Checking only amounts lesser than 0. There is not much clarity on these data. Most problem they corresponds to refunds made towards donators.

Distribution of contributed amounts

Distribution of contribution amount is not following normal distribution. Vertical line shows the median.

Histogram of contributed amounts – Log Scale

Distribution of contributed amounts - Log Scale

After applying log transformation this is not following normal distribution. Vertical lines shows the median.

Histogram of contributed amounts – SQRT Scale

Distribution of contributed amounts – SQRT scale

After applying SQRT transformation this is not following normal distribution. Vertical line shows the median.

Distribution of contributed amounts – Cuberoot scale

After applying cuberoot transformation this is not following normal distribution. Vertical line shows the median.

Distribution of contributed amounts – sqrt-log scale

After applying custom(sqrt-log) transformation this is not following normal distribution.

Plot Donation counts by Employment Status

Contributions by File_Num

Add county information along with the zipcode so that the analysis can be done at a region level

Plot Contributions by zip codes

Density Plot of Contributions by Counties

The plot below shows the donations aggregated for each county so that we can cover a larger region rather than just concentrating on zipcode which appears to be nothing more than a point in the Map.

It can be clearly seen how the donations were made across counties. It’s clear that coastal counties & southern counties accounted for most of the donations.

4. Univariate Analysis

Structure of dataset

There are a lot of donations received across California. The dataset provided was having some interesting features like candidate, contributor information(like name, employment/occupation information), contribution information(amount, date), city & zipcode information etc.

Some features are generated using the information available like gender of the person, contribution towards parties, lat-long information using zipcode, employment status.

Other observations about the dataset:

  • Most people contributed towards Democrats
  • There are a lot of contributions having value less than 0 which seems a little awkward. Maybe these are refunds requested(as seen from the receipt description) for previous donations made or maybe change provided when amounts paid in cash. Most of the people are retired where the contribution amount is less than 0. There is no clarity on this and it will better to ignore these records.

What are the main features of interest in the dataset ?

There are a number of features provided. The main features are party, candidate, donation timelines, cities/zips, contributor gender & employment status.

I believe contribution is majorly depends on the contributor itself. People can have various kinds of expectations while electing a candidate. The expectations are different from person to person. As unemployed people can have different expectations from a person who is having some business. So employment status of a person is a key factor. Again the gender of person can have some influence as females can have some different set of expectations from the male voters.

Party is a key factor as there is general tendency for some portion of the population to vote for party rather than giving much importance towards the candidate.

Candidate background can have big influence on the people. As, Hillary Clinton had a good political background and that can help people vote towards her. Again origin of descent can have some influence as Hillary is from Chicago, Illinois; and won in Illinois. On the other hand, Donald Trump is a huge background in business, and not much in politics. This can help a lot of people connected with business or self-employment to vote for him.

Cities can have some influence as generally noticed few areas will have some inclination towards some party.

Donation timelines is also very important as people will tend to donate much more while election is close.

Did you create any new variables from existing variables in the dataset?

I created a number of variables like employment status, gender, party, latitude-longitude to gain some more insights during analysis.

What is the goal of the analysis ?

The analysis focuses on various factors that can influence a person to contribute & maybe later vote for a party/candidate.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Contribution amount was left skewed and looked non-normal. I performed various kinds of transformation for reaching to a almost normal case. But these transformed distributions were non-normal too.

I did not exclude refunds as of now as did not have any clarity on these. I will analyse the negative contributions in detail. To remove outliers I remove contributions over the allowable limit for an individual ($2,700).

I did not do any formatting of the data as the dataset is already in a tidy data format.

5. Bivariate Plots Section

Pair-Plot of the variables

Plot donations received by candidates

Democrats and Hillary received most number of donations & the amount received is significantly higher than Republicans & Trump.

Contributions summary towards parties

Republicans have a much higher average & median contribution.

## # A tibble: 2 x 7
##         party avg_contribution max_contribution min_contribution
##        <fctr>            <dbl>            <dbl>            <dbl>
## 1   Democrats         103.8390            10000           -10500
## 2 Republicans         179.6183            10800           -10000
## # ... with 3 more variables: median_contribution <dbl>, IQR <dbl>,
## #   count <int>

Contribution in negative

There are a number of contributions that have a negative value. It probably means refund towards earlier contribution made(cancellation of contribution made earlier) or change given to the person. Example, A person wishes to Pay $96, and pays $100, and received $4 in return. We can not ignore this negative values.

If the dataset is examined carefully we can see there is field named “file_num”, and it might record for negative amounts.

Analyse negative donation amounts by Party

I can see that the Republicans were received more negative amounts.

Analyse negative donation amounts by contribution employment

I can see a good number of employed people have donation amounts as negative. As around 60% of the records are of employed persons we can expect a good number of negative amounts having negative ammount. There are very high number of Retired people making negative donations.

Contributions by parties

It’s clearly seen that Democarats received much more contributions.

There is not much correlation between party & contribution amounts.

##      x    y
## x 1.00 0.18
## y 0.18 1.00
## 
## n= 1299440 
## 
## 
## P
##   x  y 
## x     0
## y  0

Contribution by gender

Contributions does not vary much when compared by genders. Males have slightly higher contribution average, median, and IQR values as compared to females.

Boxplot to show how the donation ranges differ with the gender. The orange dot shows the average donation per gender. Males have slightly better average contribution as compared to females.

T-Test to find if male contributions are different than female contributions

I wil try to find out t-test results for all the contributions within 0 - 1500 for males & females.

H0 : Males average contribution is same as that of females. H1 : Males average contributions is better than that of females.

I will run t-test first on all the contribution values for these 2 groups and next I will sample 100 data points at random, take 1000 samples & run t-test on the sample means.

  • T-Test on all data points. Clearly T-Statistic value is much more than T-critical value, and we can reject the null and say males have better average.
## [1] "T-Critical value ::  2.326"
## 
##  Welch Two Sample t-test
## 
## data:  tx and ty
## t = 61.02, df = 1110200, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  15.33410 16.35185
## sample estimates:
## mean of x mean of y 
##  83.55116  67.70819
  • T-Test on 1000 random sample means. Clearly T-Statistic value is much more than T-critical value, and we can reject the null and say males have better average.
## [1] "T-Critical value ::  2.336"
## 
##  Welch Two Sample t-test
## 
## data:  mean_sampx and mean_sampy
## t = 7.1768, df = 394.26, p-value = 3.567e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   9.577796 16.805004
## sample estimates:
## mean of x mean of y 
##  80.75612  67.56472

Contributions by employment

Employed people count for most of the donations.

Boxplot to show how the donation ranges differ with the employment status. The white dot shows the average donation per category.

ANOVA Model to compare whether the contribution means significantly varies from one employment group to another

It can be clearly seen from the ANOVA model summary that F-value is 1227 & P-values is almost ZERO. We can reject the null of equal means across the employment categories. There must be one employment category at least that has a different mean from the rest, or maybe all the groups have different mean.

To Test this I am running a pairwise T-test and can see contribution means are same for “RETIRED” & “SELF-EMPLOYED” groups, and dor rest of the groups, every group has significantly different contribution means.

## -------- Anova Model Summary --------
##                  Df    Sum Sq   Mean Sq F value Pr(>F)    
## employments       5 9.647e+08 192931113    1227 <2e-16 ***
## Residuals   1299434 2.044e+11    157264                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -------- Pairwise T-Test --------
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  contributions and employments 
## 
##               BUSINESS SELF-EMPLOYED EMPLOYED RETIRED NOT EMPLOYED
## SELF-EMPLOYED < 2e-16  -             -        -       -           
## EMPLOYED      < 2e-16  < 2e-16       -        -       -           
## RETIRED       < 2e-16  1.000         < 2e-16  -       -           
## NOT EMPLOYED  < 2e-16  < 2e-16       < 2e-16  < 2e-16 -           
## NO DATA       < 2e-16  4.5e-15       < 2e-16  < 2e-16 0.092       
## 
## P value adjustment method: bonferroni

Contribution by Month

## # A tibble: 12 x 4
##     month  count total_contribution avg_contribution
##    <fctr>  <int>              <dbl>            <dbl>
##  1 Sep-16 117826        15044181.29        127.68134
##  2 Oct-16 159024        14951078.57         94.01775
##  3 Aug-16  94721        12909329.24        136.28793
##  4 Mar-16 139137         9893500.57         71.10618
##  5 Jul-16 106401         9834207.14         92.42589
##  6 Feb-16  98076         9576561.99         97.64430
##  7 Jun-16  88560         8960716.46        101.18244
##  8 May-16 101987         8342635.99         81.80097
##  9 Apr-16 126669         7564606.98         59.71948
## 10 Nov-16  77640         5950332.96         76.64004
## 11 Jan-16  47635         5781032.37        121.36102
## 12 Dec-16   1039          -67914.73        -65.36548

Correlation of contribution amounts to months.

##      x    y
## x 1.00 0.03
## y 0.03 1.00
## 
## n= 1299440 
## 
## 
## P
##   x  y 
## x     0
## y  0

Contribution by City - Top 10 Cities

## # A tibble: 10 x 4
##      contbr_city      n total_contribution avg_contribution
##            <chr>  <int>              <dbl>            <dbl>
##  1   LOS ANGELES 102475           16118142        157.28853
##  2 SAN FRANCISCO  90595           15259760        168.43931
##  3     SAN DIEGO  45935            3811645         82.97911
##  4     PALO ALTO  11989            3224361        268.94327
##  5       OAKLAND  33159            3126871         94.29933
##  6 BEVERLY HILLS   6789            3120463        459.63510
##  7      BERKELEY  23070            2835139        122.89289
##  8  SANTA MONICA  14455            2825475        195.46695
##  9      SAN JOSE  30550            2384275         78.04501
## 10    SACRAMENTO  23734            2330008         98.17172

Contribution by Filenum

There are a set of filenames against which contributions are received. I believe this files corresponds to particular accounts maintained by candidates/parties. They are not of much importance. I can see there are some file numbers that are only used for negative amounts or more specifically refunds.

## # A tibble: 175 x 4
##    file_num     n total_contribution avg_contribution
##       <int> <int>              <dbl>            <dbl>
##  1  1079053   334         -632034.80       -1892.3198
##  2  1111847   627         -511418.07        -815.6588
##  3  1073684    89         -124109.20       -1394.4854
##  4  1146285   279          -51697.72        -185.2965
##  5  1057130    79          -40326.00        -510.4557
##  6  1088217    17          -34254.00       -2014.9412
##  7  1096256    57          -32460.00        -569.4737
##  8  1145574    11          -27101.00       -2463.7273
##  9  1107453    15          -25500.00       -1700.0000
## 10  1073714     9          -21599.00       -2399.8889
## # ... with 165 more rows

Plot contribution on map by party

It can be seen clearly that Democrats got contributions much more than Republican and every region contributed for them. In some areas Republicans(red bubble) dominated but in most the areas across CA the democrats(blue bubble) got better contribution.

When the map is divided by party both look similar but democrats has slight much density in certain areas and overall than the republicans.

Plot contribution on map by gender

It can be seen that there are all kinds of bubbles spread across CA. I can see slightly more PinK bubbles that means there are more females.

Density Plot of Contributions across Counties by Party

It can be seen that Democrats has much better density in coastal counties & southern counties. For rest of the counties slight better color density indicates slightly more donations towards Democrats.

Density Plot of Contributions across Counties by Gender

There is not much difference visible from the density plot.

Heatmap of donations towards candidates by gender

It can be clearly seen from the heatmap that Hillary gets most of the donations than other candidates.

Heatmap of donations towards candidates by employment

6. Bivariate Analysis

How did the feature(s) of interest vary with other features in the dataset?

The people of California are inclined towards Democrats & Hillary. Most of the contributions were made for Hillary.

Male & Females are almost having same contribution range. However males have a much higher outliers. There are slight differences as compared by people’s employment status. Business people contributed higher and unemployment people could contribute least. As most of the people are employed the contribution for employed histogram is similar to the histogram of the whole dataset.

Most of the contributions are from San Francisco, San Jose, San Diego, Sacramento & Los Angles areas. There are not much difference by gender, party. The contributions varies by month to month.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There are not much I can observe from other features. I can only see there are a number of file numbers against which contributions are made. There are file numbers for refunds as I can see some file numbers accounted for negative amounts. Apart from that its interesting to see the donations are distributed across CA, and most of the donations are recived from cities like San Diego, San Francisco, LA etc.

What was the strongest relationship you found?

There was not much strong relationship I could see. Only employment status has some influence on the contributions as compared to other features.

7. Multivariate Plots Section

Contribution by gender & party

Both the genders have similar set of distributions but females made much more donations in the range of 0 - 30 dollars. Females contributed more for Democrats as compared to males, and males contributed slightly more towards republicans.

Contribution by gender & party boxplot

The below boxplot shows males contributed slightly more towards Republicans as compared to females. Orange points shows the mean.

This below plot will show the total contributions per party by gender & employment status.

Contribution by employment status & party boxplot

Republicans got slightly higher valued donations across all employment category level. No wonder why the average donation received is more for them.

Contribution analysis by gender & employment status

There are a lot of donations and more towards democrats. It is seen that overall republicans average is more than democrats. It is clearly seen that republicans has recieve more in average and more male avg. as compared to females.

## # A tibble: 4 x 6
## # Groups:   party [?]
##         party gender  count total_contributions avg_contribution
##        <fctr> <fctr>  <int>               <dbl>            <dbl>
## 1   Democrats   male 458626            53735920        117.16719
## 2   Democrats female 588338            57597421         97.89852
## 3 Republicans   male 114493            24023907        209.82861
## 4 Republicans female  75582            12205294        161.48413
## # ... with 1 more variables: median_contribution <dbl>

Males donated slightly more than females on average across all income categories. This appiles both for Republicans & Democrats.

Density Plot of Contributions across Counties by Gender & Party

Male donations have better density for Republicans. For Democrats there is not much visible difference in density of male & female donations.

Total Contributions by party & timeline

Democrats received much more donations that can be clearly seen from the below summary tables.

Republicans received donations throughout the months without having much different from one month to the next.

– Democrats

## # A tibble: 6 x 3
## # Groups:   party [1]
##       party  month total_contributions
##      <fctr> <fctr>               <dbl>
## 1 Democrats Sep-16            12328140
## 2 Democrats Oct-16            12319218
## 3 Democrats Aug-16            10338443
## 4 Democrats Mar-16             8458088
## 5 Democrats May-16             8124709
## 6 Democrats Jul-16             7731627

– Republicans

## # A tibble: 6 x 3
## # Groups:   party [1]
##         party  month total_contributions
##        <fctr> <fctr>               <dbl>
## 1 Republicans Jun-15             3232755
## 2 Republicans Dec-15             2983170
## 3 Republicans Sep-15             2888453
## 4 Republicans Sep-16             2716041
## 5 Republicans Oct-16             2631860
## 6 Republicans Aug-16             2570886

The below plot shows the contribution totals by timelines and grouped by party & gender.

Statistics summary of the donations by timeline

This plot will show the average contributions made per month towards each party for both males & females.

– Average Average donations were almost same for both the parties till August. Afterwards the average significantly increased for Republicans.

– Median Republicans have much higher medians as compared to Democrats. This indicates that economically stronger sections supported more to Republicans.

– Total It can be clearly seen Democrats received much more donations as compared to Republicans and females donated more to Democrats.

Linear model for predicting contribution amount based on the main features

## 
## Call:
## lm(formula = I(contb_receipt_amt) ~ I(party) + gender + contbr_employment_status + 
##     month, data = dfh)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10518.9    -83.0    -52.6    -12.8  10754.1 
## 
## Coefficients:
##                                        Estimate Std. Error t value
## (Intercept)                             96.7556     2.9283  33.042
## I(party)Republicans                     44.2966     0.9513  46.562
## genderfemale                           -20.1739     0.6048 -33.356
## contbr_employment_statusSELF-EMPLOYED  -43.8213     4.2022 -10.428
## contbr_employment_statusEMPLOYED       -20.1728     2.8001  -7.204
## contbr_employment_statusRETIRED        -50.6549     2.8536 -17.751
## contbr_employment_statusNOT EMPLOYED   -56.9198     2.9750 -19.133
## contbr_employment_statusNO DATA       -100.0324     3.1477 -31.779
## monthAug-16                             77.5338     1.3933  55.647
## monthDec-16                           -137.2432    10.0252 -13.690
## monthFeb-16                             32.9957     1.3619  24.227
## monthJan-16                             55.5664     1.7195  32.316
## monthJul-16                             32.5743     1.3469  24.185
## monthJun-16                             42.3675     1.4038  30.180
## monthMar-16                             11.1666     1.2416   8.994
## monthMay-16                             25.0229     1.3456  18.596
## monthNov-16                             19.5455     1.4691  13.305
## monthOct-16                             37.6233     1.2209  30.817
## monthSep-16                             71.7516     1.3100  54.773
##                                       Pr(>|t|)    
## (Intercept)                            < 2e-16 ***
## I(party)Republicans                    < 2e-16 ***
## genderfemale                           < 2e-16 ***
## contbr_employment_statusSELF-EMPLOYED  < 2e-16 ***
## contbr_employment_statusEMPLOYED      5.83e-13 ***
## contbr_employment_statusRETIRED        < 2e-16 ***
## contbr_employment_statusNOT EMPLOYED   < 2e-16 ***
## contbr_employment_statusNO DATA        < 2e-16 ***
## monthAug-16                            < 2e-16 ***
## monthDec-16                            < 2e-16 ***
## monthFeb-16                            < 2e-16 ***
## monthJan-16                            < 2e-16 ***
## monthJul-16                            < 2e-16 ***
## monthJun-16                            < 2e-16 ***
## monthMar-16                            < 2e-16 ***
## monthMay-16                            < 2e-16 ***
## monthNov-16                            < 2e-16 ***
## monthOct-16                            < 2e-16 ***
## monthSep-16                            < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 313.4 on 1117187 degrees of freedom
## Multiple R-squared:  0.01228,    Adjusted R-squared:  0.01226 
## F-statistic: 771.4 on 18 and 1117187 DF,  p-value: < 2.2e-16

Random Forest Model to predict depending on features donations will be made to which party

## 
## Call:
##  randomForest(formula = party ~ contb_receipt_amt + month + gender +      contbr_employment_status, data = dfh, mtry = 3) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 10.47%
## Confusion matrix:
##             Democrats Republicans class.error
## Democrats      156278        6039  0.03720498
## Republicans     14894       22789  0.39524454

Random Forest Model Error

8. Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In general there was not much relationship within variables or contributions. There was general trend of people contributing more towards Democrats, and this was strenthened by the timing of elections. People started donating much as election approached. Whereas Republicans received almost same amount of contributions month by month.

Were there any interesting or surprising interactions between features?

Interest features what I find are below.

  • Republicans received much higher contributions on average. Again the same is noticed over all income category people or across different employment category.

  • Males contributed to Republicans with more average than females.

  • Republicans contribution did not increase with time. It remained constant. This is surprising as in general people tend to contribute as the election comes closer.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I tried creating a linear model to fit to estimating the contribution amount depending on features like party, gender, timing, employment status; but it did not look good.

Next, I tried to create a random forest model to predict the party based on the contribution amount, gender and other features. It did a pretty decent job, and for predicting Democrats. It does not do good while predicting Republicans. But overall around 10% error rate is not bad. Creating a such prediction model is out of scope of the project.


9. Final Plots

Plot One

This plot will show the total contributions per party by gender & employment status.

Description One

It can be clearly seen that males contributed more even though the number of female contribution is more. Employed people are donated almost more than 90% of the total. For democrats males & females are equally contributed; but for republicans males contributed slightly more than females.


Plot Two

This plot will show the total contributions per party by gender & timing.

Description Two

The 2016 US Presidential elections was help on 8th November. In general people tend to become more involved as the elections come closed. The above bars shows total contributions received per month of entire 2016. It can be clearly seen that for Democrats there is a constant increase in total contribution from previous month. This trend is seen for both males & females. For Republicans there is no such trend, the total amount of contributions are pretty uneven and even if there are some increase but it’s not comparable with the increase for Democrats.


Plot Three

This plot will show moving average of the average contributions made per month towards each party for both males & females.

Description Three

It can be clearly seen that males have slight more average for donations made to both the parties. Another interesting observation is for Republicans there is very much increase in average donations. Whereas for Democrats there is decrease after Aug. This is exactly opposite of what we observer earlier.

Earlier we dealt in total contributions only and did not take average contributions into consideration. I think there are a lot of people donated for Democrats with slightly smaller average making the total donations collectively much more than that of Republicans.

We don’t have enough data to understand what section of people donated for which party. But it can be assumed less number of people donated to Republicans with more average donations. They must have come from the stronger economic section of the society. Unfortunately we don’t have any data on contributor’s socio-economic status to analyse any further.


10. Reflection

This campaign contributions dataset was an interesting one and I used the provided features as-is and derived few features to add more features to the analysis. The most significant features that could be used to predict contribution amount are party, gender, employment status & contribution timiline. The features are playing some role in predicting the contribution amount.

I tried to use a linear model to predict how the features are useful to predict the contribution amount. It did not do much good job.

Again I tried to create a random forest model to predict the contribution party depending on the contribution amount, gender, employment status & timeline provided, and it did not so bad while making the predictions.

It seemed the total contribution counts and total contribution amount were correlated with cities and much densed in cities like San Francisco, San Diego, LA etc. For zipcode that didn’t seem to be citites or in rural areas does not have much density or bigger bubbles representing huge amount of donation or a lot of donation counts.

Some interest facts were seen where total contribution counts and amounts much higher towards Democrats but had lower average. This can show that Republicans were mostly supported by people with stronger socio-econimic status.

There were some problems encountered for having names in some dirty format. It was not possible to programatically extract first names for everyone and also the gender package used is unable to match some names. Due to this some names were missed and some genders were not predicted. Another problem encountered while categorizing the employment status using the contributor’s occupation. There were also some missing data encounter that could not be understood due to not having much meaningful information.

I would like to add some ideas for the dataset as inclusion of few features will make the dataset more understandable and enrich the analysis. The features can be included like Age, Gender, Income range, Dependents. There are some information a section of people might not share like Income range or Age. But inclusion of these will enrich the dataset. Again we can include the location information like City, Urban, Semi-urban & Rural.

There are a lot of further analysis that can be done if these features are available.

As analysed from the dataset it’s very much obvious that most of the people contributed for Democrats & Hillary Clinton. Maybe these people will vote for Democrats and not surprisingly it turns out to be Democrats winning in CA with a very good margin. So a general trend was seen from the dataset.